Project 1, MIE1626 Winter 2022

Data Science Methods and Quantitative Analysis - Home Credit Default Risk [90 Marks]

Deadline: February 11 at 21:00

Academic Integrity

This project is individual: It is to be completed on your own. Do not share your code with others, or post any parts of your work online. You can only submit code that you have written yourself. If you use any online resource for developing parts of your code, acknowledge the source in a comment in your code. Students suspected of plagiarism on a project will be referred to the university for formal discipline according to the Code of Behaviour on Academic Matters.

Please fill out the following:

  • Name: Tianyu Shi
  • Student Number: 1007612264
  • UTorID: shitia26

Part 1: Getting started [20 Marks]

  • Load the data

Load the six csv files as pandas dataframes using a string 'path' for the location of files on your system (to be then updated by the marker for evaluation)

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import math
from varclushi import VarClusHi
from sklearn import feature_selection
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans

from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, train_test_split, RandomizedSearchCV, KFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, f1_score, precision_score, recall_score, auc, roc_curve, confusion_matrix, classification_report

import os
import gc

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

from xgboost import XGBClassifier, plot_importance

import sys
In [3]:
!pwd
/home/tianyu/code/project1

load the dataset

In [4]:
### YOUR CODE HERE ###
#e.g. application_train

application_train = pd.read_csv('data/home-credit-default-risk/application_train.csv')
bureau = pd.read_csv('data/home-credit-default-risk/bureau.csv')
bureau_balance = pd.read_csv('data/home-credit-default-risk/bureau_balance.csv')
POS_CASH_balance = pd.read_csv('data/home-credit-default-risk/POS_CASH_balance.csv')
credit_card_balance = pd.read_csv('data/home-credit-default-risk/credit_card_balance.csv')
installments_payments = pd.read_csv('data/home-credit-default-risk/installments_payments.csv')

print('shape of application_train',application_train.shape)
print('shape of bureau',bureau.shape)
print('shape of bureau_balance',bureau_balance.shape)
print('shape of POS_CASH_balance',POS_CASH_balance.shape)
print('shape of credit_card_balance',credit_card_balance.shape)
print('shape of installments_payments',installments_payments.shape)
shape of application_train (307511, 122)
shape of bureau (1716428, 17)
shape of bureau_balance (27299925, 3)
shape of POS_CASH_balance (10001358, 8)
shape of credit_card_balance (3840312, 23)
shape of installments_payments (13605401, 8)

1.1 Data transformation and cleaning [15 Marks]

To conduct data cleaning, the first step is create and select relevant features. This includes the following sub-steps:

  1. Remove/disregard features that have a constant value.
  2. Remove/disregard features that are irrelevant to credit risk or can’t be explained. This includes time variables that are only related to the application and other variables that are not related to the response.
  3. Handle missing values.
  4. Calculate the proportion of each outcome in dichotomous value to transform categorical variables to continuous variables (explained below).
  5. Create new variables.

See details in the project handout.

bureau_balance data set

In [5]:
bureau_balance
Out[5]:
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
... ... ... ...
27299920 5041336 -47 X
27299921 5041336 -48 X
27299922 5041336 -49 X
27299923 5041336 -50 X
27299924 5041336 -51 X

27299925 rows × 3 columns

In [6]:
# calculate the proportion of 0 in the variable STATUS. (do I need to create any new variable?)
proportion_of_0= (bureau_balance['STATUS'].values == '0').sum()/(len(bureau_balance['STATUS']))
print('proportion_of_0 in the total dataset:',proportion_of_0)
proportion_of_0 in the total dataset: 0.2747079708094436
In [7]:
def status_to_numric_0(x):
    if x=='0':
        return 0
    else:
        return 1
In [8]:
# calculate the proportion of 0 in the variable STATUS. (do I need to create any new variable?)

## 1. there are some id that doesn't have 0, so it will end up nan, so we need to transform it numric value
## 2. we use sum / count to get non-zero proportion, then we can get zero proportion as 1- (non-zero proportion)
bureau_balance['0_or_not']=bureau_balance['STATUS'].apply(status_to_numric_0)
a=bureau_balance.groupby('SK_ID_BUREAU')['STATUS'].count().rename('status_total').reset_index()
b=bureau_balance.groupby('SK_ID_BUREAU')['0_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_BUREAU')
c['0_prop']=1-c['status_sum']/c['status_total']
bureau_balance=pd.merge(c,bureau_balance,on='SK_ID_BUREAU')
bureau_balance
Out[8]:
SK_ID_BUREAU status_total status_sum 0_prop MONTHS_BALANCE STATUS 0_or_not
0 5001709 97 97 0.000000 0 C 1
1 5001709 97 97 0.000000 -1 C 1
2 5001709 97 97 0.000000 -2 C 1
3 5001709 97 97 0.000000 -3 C 1
4 5001709 97 97 0.000000 -4 C 1
... ... ... ... ... ... ... ...
27299920 6842888 62 53 0.145161 -57 0 0
27299921 6842888 62 53 0.145161 -58 1 1
27299922 6842888 62 53 0.145161 -59 0 0
27299923 6842888 62 53 0.145161 -60 0 0
27299924 6842888 62 53 0.145161 -61 0 0

27299925 rows × 7 columns

In [9]:
# print(bureau_balance.loc[bureau_balance['SK_ID_BUREAU']==5715448]) ## this is to check whether i am calculating correct
In [10]:
# Drop MONTHS_BALANCE as it’s irrelevant to the clients’ behaviour
bureau_balance_d=bureau_balance.drop('MONTHS_BALANCE',1)
bureau_balance_d=bureau_balance.drop(['status_sum','status_total','0_or_not'],1) ## we also want to remove these intermediate variables

bureau_balance_d
Out[10]:
SK_ID_BUREAU 0_prop MONTHS_BALANCE STATUS
0 5001709 0.000000 0 C
1 5001709 0.000000 -1 C
2 5001709 0.000000 -2 C
3 5001709 0.000000 -3 C
4 5001709 0.000000 -4 C
... ... ... ... ...
27299920 6842888 0.145161 -57 0
27299921 6842888 0.145161 -58 1
27299922 6842888 0.145161 -59 0
27299923 6842888 0.145161 -60 0
27299924 6842888 0.145161 -61 0

27299925 rows × 4 columns

In [11]:
# merge bureau_balance to bureau using the key SK_ID_BUREAU
bureau_combine=bureau.merge(bureau_balance_d,left_on='SK_ID_BUREAU',right_on='SK_ID_BUREAU')
In [12]:
bureau_combine.columns,bureau.columns
Out[12]:
(Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
        'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
        'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
        'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
        'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
        'AMT_ANNUITY', '0_prop', 'MONTHS_BALANCE', 'STATUS'],
       dtype='object'),
 Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
        'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
        'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
        'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
        'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
        'AMT_ANNUITY'],
       dtype='object'))
In [13]:
bureau_combine
Out[13]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS
0 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 67500.0 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 0 C
1 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 67500.0 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -1 C
2 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 67500.0 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -2 C
3 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 67500.0 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -3 C
4 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 67500.0 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -4 C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 21883.5 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -75 X
24179737 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 21883.5 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -76 X
24179738 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 21883.5 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -77 X
24179739 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 21883.5 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -78 X
24179740 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 21883.5 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -79 X

24179741 rows × 20 columns

bureau data set

In [14]:
bureau_combine.columns
Out[14]:
Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
       'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
       'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
       'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
       'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
       'AMT_ANNUITY', '0_prop', 'MONTHS_BALANCE', 'STATUS'],
      dtype='object')
In [15]:
# For each SK_ID_CURR, calculate the proportion of Closed and Active of CREDIT_ACTIVE. (do we have to do anything with this calculation? here is the total percentage)
proportion_of_closed= (bureau_combine['CREDIT_ACTIVE'].values == 'Closed').sum()/(len(bureau_combine['CREDIT_ACTIVE']))
proportion_of_active= (bureau_combine['CREDIT_ACTIVE'].values == 'Active').sum()/(len(bureau_combine['CREDIT_ACTIVE']))

proportion_of_closed, proportion_of_active
Out[15]:
(0.7655935189711089, 0.22935282888265843)
In [16]:
bureau_combine.CREDIT_ACTIVE.value_counts()
Out[16]:
Closed      18511853
Active       5545692
Sold          122096
Bad debt         100
Name: CREDIT_ACTIVE, dtype: int64
In [17]:
def status_to_numric_actclos(x):
    if x=='Closed' or x=='Active':
        return 1
    else:
        return 0
In [18]:
bureau_combine['clos_active_or_not']=bureau_combine['CREDIT_ACTIVE'].apply(status_to_numric_actclos)
bureau_combine
Out[18]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG ... AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS clos_active_or_not
0 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 ... 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 0 C 1
1 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 ... 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -1 C 1
2 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 ... 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -2 C 1
3 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 ... 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -3 C 1
4 380361 5715448 Active currency 1 -820 0 31069.0 NaN NaN 0 ... 0.0 67500.0 0.0 Credit card -183 0.0 0.296296 -4 C 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 ... 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -75 X 1
24179737 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 ... 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -76 X 1
24179738 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 ... 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -77 X 1
24179739 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 ... 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -78 X 1
24179740 407724 5053758 Closed currency 1 -2423 0 -2150.0 -2175.0 NaN 0 ... 0.0 NaN 0.0 Consumer credit -787 NaN 0.000000 -79 X 1

24179741 rows × 21 columns

In [19]:
a=bureau_combine.groupby('SK_ID_CURR')['CREDIT_ACTIVE'].count().rename('status_total').reset_index()
b=bureau_combine.groupby('SK_ID_CURR')['clos_active_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_CURR')
c['clos_active_prop']=c['status_sum']/c['status_total']
bureau_combine=pd.merge(c,bureau_combine,on='SK_ID_CURR')
In [20]:
bureau_combine
Out[20]:
SK_ID_CURR status_total status_sum clos_active_prop SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE ... AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS clos_active_or_not
0 100001 172 172 1.0 5896630 Closed currency 1 -857 0 -492.0 ... 0.0 0.0 0.0 Consumer credit -155 0.0 0.103448 0 C 1
1 100001 172 172 1.0 5896630 Closed currency 1 -857 0 -492.0 ... 0.0 0.0 0.0 Consumer credit -155 0.0 0.103448 -1 C 1
2 100001 172 172 1.0 5896630 Closed currency 1 -857 0 -492.0 ... 0.0 0.0 0.0 Consumer credit -155 0.0 0.103448 -2 C 1
3 100001 172 172 1.0 5896630 Closed currency 1 -857 0 -492.0 ... 0.0 0.0 0.0 Consumer credit -155 0.0 0.103448 -3 C 1
4 100001 172 172 1.0 5896630 Closed currency 1 -857 0 -492.0 ... 0.0 0.0 0.0 Consumer credit -155 0.0 0.103448 -4 C 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 456255 293 293 1.0 5126337 Closed currency 1 -2337 0 -1243.0 ... NaN NaN 0.0 Consumer credit -1621 0.0 0.028571 -72 C 1
24179737 456255 293 293 1.0 5126337 Closed currency 1 -2337 0 -1243.0 ... NaN NaN 0.0 Consumer credit -1621 0.0 0.028571 -73 C 1
24179738 456255 293 293 1.0 5126337 Closed currency 1 -2337 0 -1243.0 ... NaN NaN 0.0 Consumer credit -1621 0.0 0.028571 -74 C 1
24179739 456255 293 293 1.0 5126337 Closed currency 1 -2337 0 -1243.0 ... NaN NaN 0.0 Consumer credit -1621 0.0 0.028571 -75 0 1
24179740 456255 293 293 1.0 5126337 Closed currency 1 -2337 0 -1243.0 ... NaN NaN 0.0 Consumer credit -1621 0.0 0.028571 -76 X 1

24179741 rows × 24 columns

In [21]:
# print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_ACTIVE) ## this is to check whether i am calculating correct
# print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_ACTIVE.value_counts) ## this is to check whether i am calculating correct
In [22]:
bureau_combine=bureau_combine.drop(['status_sum','status_total','clos_active_or_not'],1) ## we also want to remove these intermediate variables
In [23]:
# Remove CREDIT_CURRENCY as it’s constant.
bureau_combine = bureau_combine.drop('CREDIT_CURRENCY',1)
bureau_combine.columns
Out[23]:
Index(['SK_ID_CURR', 'clos_active_prop', 'SK_ID_BUREAU', 'CREDIT_ACTIVE',
       'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE',
       'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
       'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
       'AMT_CREDIT_SUM_OVERDUE', 'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
       'AMT_ANNUITY', '0_prop', 'MONTHS_BALANCE', 'STATUS'],
      dtype='object')
In [24]:
# Remove DAYS_CREDIT, CREDIT_DAY_OVERDUE, DAYS_CREDIT_ENDDATE, DAYS_ENDDATE_FACT and
# DAYS_CREDIT_UPDATE as these are irrelevant variables.
bureau_combine = bureau_combine.drop(['DAYS_CREDIT','CREDIT_DAY_OVERDUE', 'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT','DAYS_CREDIT_UPDATE'],1)
bureau_combine.columns
Out[24]:
Index(['SK_ID_CURR', 'clos_active_prop', 'SK_ID_BUREAU', 'CREDIT_ACTIVE',
       'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG', 'AMT_CREDIT_SUM',
       'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE',
       'CREDIT_TYPE', 'AMT_ANNUITY', '0_prop', 'MONTHS_BALANCE', 'STATUS'],
      dtype='object')
In [25]:
bureau_combine
Out[25]:
SK_ID_CURR clos_active_prop SK_ID_BUREAU CREDIT_ACTIVE AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS
0 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 0 C
1 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -1 C
2 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -2 C
3 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -3 C
4 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -4 C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -72 C
24179737 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -73 C
24179738 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -74 C
24179739 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -75 0
24179740 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -76 X

24179741 rows × 15 columns

In [26]:
# no longer need this key ?
# bureau_combine=bureau_combine.drop(['SK_ID_BUREAU'], axis = 1) 
In [27]:
bureau_combine.shape
Out[27]:
(24179741, 15)
In [28]:
bureau_combine.CREDIT_TYPE.value_counts()
Out[28]:
Consumer credit                                 18333667
Credit card                                      5019228
Car loan                                          446901
Mortgage                                          251213
Microloan                                          48856
Loan for business development                      42537
Another type of loan                               17271
Unknown type of loan                               13129
Loan for working capital replenishment              4902
Cash loan (non-earmarked)                           1445
Loan for the purchase of equipment                   291
Real estate loan                                     240
Loan for purchase of shares (margin lending)          35
Mobile operator loan                                  26
Name: CREDIT_TYPE, dtype: int64
In [29]:
def credit_to_numric(x):
    if x=='Consumer credit' or x=='Credit card':
        return 1
    else:
        return 0

bureau_combine['cons_cred_card_or_not']=bureau_combine['CREDIT_TYPE'].apply(credit_to_numric)
bureau_combine
Out[29]:
SK_ID_CURR clos_active_prop SK_ID_BUREAU CREDIT_ACTIVE AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS cons_cred_card_or_not
0 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 0 C 1
1 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -1 C 1
2 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -2 C 1
3 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -3 C 1
4 100001 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -4 C 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -72 C 1
24179737 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -73 C 1
24179738 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -74 C 1
24179739 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -75 0 1
24179740 456255 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -76 X 1

24179741 rows × 16 columns

In [30]:
# For each SK_ID_CURR, calculate the proportion of Consumer credit and Credit card of CREDIT_TYPE.

a=bureau_combine.groupby('SK_ID_CURR')['CREDIT_TYPE'].count().rename('status_total').reset_index()
b=bureau_combine.groupby('SK_ID_CURR')['cons_cred_card_or_not'].sum().rename('status_sum').reset_index()
c=pd.merge(a,b,on='SK_ID_CURR')
c['cons_cred_card_prop']=c['status_sum']/c['status_total']
bureau_combine=pd.merge(c,bureau_combine,on='SK_ID_CURR')
bureau_combine=bureau_combine.drop(['status_sum','status_total','cons_cred_card_or_not'],1) ## we also want to remove these intermediate variables
In [31]:
print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].cons_cred_card_prop.value_counts()) ## this is to check whether i am calculating correct
print(bureau_combine.loc[bureau_combine['SK_ID_CURR']==456255].CREDIT_TYPE.value_counts()) ## this is to check whether i am calculating correct
# bureau_combine
1.0    293
Name: cons_cred_card_prop, dtype: int64
Consumer credit    247
Credit card         46
Name: CREDIT_TYPE, dtype: int64
In [32]:
# calculate the average for the numerical variable per SK_ID_CURR and use the calculated average
# to present the corresponding feature value for that SK_ID_CURR For example, the variable that
# you created based on the proportion of 0 in the variable STATUS is to be averaged for each unique
# SK_ID_CURR before being merged with application_train.
bureau_num = bureau_combine.groupby(by=['SK_ID_CURR']).mean().reset_index()                          # group the numeric features by SK_ID_CURR
print(bureau_num.shape, "- shape of numeric bureau features (incl index)")                          
bureau_cat = pd.get_dummies(bureau_combine.select_dtypes('object'))                                         # this got rid of the SK_ID_CURR column ...
bureau_cat['SK_ID_CURR'] = bureau_combine['SK_ID_CURR']                                                     # so we have to replace it
bureau_cat = bureau_cat.groupby(by = ['SK_ID_CURR']).mean().reset_index()                          # tried sum - didn't change anything
print(bureau_cat.shape, "- shape of categorical bureau features (incl index)")                      # should be 
(134542, 13) - shape of numeric bureau features (incl index)
(134542, 27) - shape of categorical bureau features (incl index)
In [33]:
bureau_cat
Out[33]:
SK_ID_CURR CREDIT_ACTIVE_Active CREDIT_ACTIVE_Bad debt CREDIT_ACTIVE_Closed CREDIT_ACTIVE_Sold CREDIT_TYPE_Another type of loan CREDIT_TYPE_Car loan CREDIT_TYPE_Cash loan (non-earmarked) CREDIT_TYPE_Consumer credit CREDIT_TYPE_Credit card ... CREDIT_TYPE_Real estate loan CREDIT_TYPE_Unknown type of loan STATUS_0 STATUS_1 STATUS_2 STATUS_3 STATUS_4 STATUS_5 STATUS_C STATUS_X
0 100001 0.186047 0.0 0.813953 0.0 0.0 0.000000 0.0 1.000000 0.000000 ... 0.0 0.0 0.180233 0.005814 0.0 0.0 0.0 0.0 0.639535 0.174419
1 100002 0.181818 0.0 0.818182 0.0 0.0 0.000000 0.0 0.472727 0.527273 ... 0.0 0.0 0.409091 0.245455 0.0 0.0 0.0 0.0 0.209091 0.136364
2 100005 0.380952 0.0 0.619048 0.0 0.0 0.000000 0.0 0.380952 0.619048 ... 0.0 0.0 0.666667 0.000000 0.0 0.0 0.0 0.0 0.238095 0.095238
3 100010 0.500000 0.0 0.500000 0.0 0.0 0.000000 0.0 0.500000 0.000000 ... 0.0 0.0 0.277778 0.000000 0.0 0.0 0.0 0.0 0.722222 0.000000
4 100013 0.000000 0.0 1.000000 0.0 0.0 0.526087 0.0 0.473913 0.000000 ... 0.0 0.0 0.343478 0.030435 0.0 0.0 0.0 0.0 0.447826 0.178261
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
134537 456247 0.146875 0.0 0.853125 0.0 0.0 0.000000 0.0 0.887500 0.037500 ... 0.0 0.0 0.206250 0.000000 0.0 0.0 0.0 0.0 0.684375 0.109375
134538 456250 0.620690 0.0 0.379310 0.0 0.0 0.000000 0.0 0.678161 0.321839 ... 0.0 0.0 0.137931 0.000000 0.0 0.0 0.0 0.0 0.287356 0.574713
134539 456253 0.470085 0.0 0.529915 0.0 0.0 0.000000 0.0 0.735043 0.264957 ... 0.0 0.0 0.401709 0.000000 0.0 0.0 0.0 0.0 0.487179 0.111111
134540 456254 0.000000 0.0 1.000000 0.0 0.0 0.000000 0.0 1.000000 0.000000 ... 0.0 0.0 0.216216 0.000000 0.0 0.0 0.0 0.0 0.783784 0.000000
134541 456255 0.317406 0.0 0.682594 0.0 0.0 0.000000 0.0 0.843003 0.156997 ... 0.0 0.0 0.215017 0.000000 0.0 0.0 0.0 0.0 0.726962 0.058020

134542 rows × 27 columns

In [34]:
bureau_num
Out[34]:
SK_ID_CURR cons_cred_card_prop clos_active_prop SK_ID_BUREAU AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE AMT_ANNUITY 0_prop MONTHS_BALANCE
0 100001 1.000000 1.0 5.896632e+06 NaN 0.000000 161516.250000 23893.770349 0.000000 0.0 1236.244186 0.180233 -16.279070
1 100002 1.000000 1.0 6.152350e+06 1312.010357 0.000000 111388.838727 70223.142857 3198.856500 0.0 0.000000 0.409091 -24.554545
2 100005 1.000000 1.0 6.735201e+06 0.000000 0.000000 175903.714286 132923.785714 0.000000 0.0 608.785714 0.666667 -4.333333
3 100010 0.500000 1.0 5.576631e+06 NaN 0.000000 495000.000000 174003.750000 0.000000 0.0 NaN 0.277778 -46.000000
4 100013 0.473913 1.0 5.922081e+06 19305.000000 0.000000 532530.923478 0.000000 NaN 0.0 0.000000 0.343478 -29.373913
... ... ... ... ... ... ... ... ... ... ... ... ... ...
134537 456247 0.925000 1.0 6.353728e+06 4612.367773 0.031250 367670.334375 417788.571429 0.000000 0.0 4837.617904 0.206250 -27.281250
134538 456250 1.000000 1.0 6.817237e+06 0.000000 0.000000 971625.227586 676051.789655 18753.043448 0.0 170124.655862 0.137931 -14.149425
134539 456253 1.000000 1.0 5.864376e+06 NaN 0.000000 914615.384615 378608.923077 0.000000 0.0 58369.500000 0.401709 -14.282051
134540 456254 1.000000 1.0 6.669849e+06 NaN 0.000000 45000.000000 0.000000 NaN 0.0 0.000000 0.216216 -18.000000
134541 456255 1.000000 1.0 5.126332e+06 13629.984961 0.112628 342536.498294 179242.109531 0.000000 0.0 1059.695565 0.215017 -26.392491

134542 rows × 13 columns

In [35]:
bureau_combine
Out[35]:
SK_ID_CURR cons_cred_card_prop clos_active_prop SK_ID_BUREAU CREDIT_ACTIVE AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE AMT_ANNUITY 0_prop MONTHS_BALANCE STATUS
0 100001 1.0 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 0 C
1 100001 1.0 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -1 C
2 100001 1.0 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -2 C
3 100001 1.0 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -3 C
4 100001 1.0 1.0 5896630 Closed NaN 0 112500.0 0.0 0.0 0.0 Consumer credit 0.0 0.103448 -4 C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24179736 456255 1.0 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -72 C
24179737 456255 1.0 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -73 C
24179738 456255 1.0 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -74 C
24179739 456255 1.0 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -75 0
24179740 456255 1.0 1.0 5126337 Closed 16618.5 0 450000.0 NaN NaN 0.0 Consumer credit 0.0 0.028571 -76 X

24179741 rows × 16 columns

In [36]:
bureau_combine_d=bureau_combine.drop(['CREDIT_ACTIVE','CREDIT_TYPE','STATUS'],1) ## we also want to remove these intermediate variables
bureau_combine_d=bureau_combine_d.groupby(by=['SK_ID_CURR']).mean().reset_index()         # group the numeric features by SK_ID_CURR by average
bureau_combine_d ## do we need to fill in nan??
Out[36]:
SK_ID_CURR cons_cred_card_prop clos_active_prop SK_ID_BUREAU AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE AMT_ANNUITY 0_prop MONTHS_BALANCE
0 100001 1.000000 1.0 5.896632e+06 NaN 0.000000 161516.250000 23893.770349 0.000000 0.0 1236.244186 0.180233 -16.279070
1 100002 1.000000 1.0 6.152350e+06 1312.010357 0.000000 111388.838727 70223.142857 3198.856500 0.0 0.000000 0.409091 -24.554545
2 100005 1.000000 1.0 6.735201e+06 0.000000 0.000000 175903.714286 132923.785714 0.000000 0.0 608.785714 0.666667 -4.333333
3 100010 0.500000 1.0 5.576631e+06 NaN 0.000000 495000.000000 174003.750000 0.000000 0.0 NaN 0.277778 -46.000000
4 100013 0.473913 1.0 5.922081e+06 19305.000000 0.000000 532530.923478 0.000000 NaN 0.0 0.000000 0.343478 -29.373913
... ... ... ... ... ... ... ... ... ... ... ... ... ...
134537 456247 0.925000 1.0 6.353728e+06 4612.367773 0.031250 367670.334375 417788.571429 0.000000 0.0 4837.617904 0.206250 -27.281250
134538 456250 1.000000 1.0 6.817237e+06 0.000000 0.000000 971625.227586 676051.789655 18753.043448 0.0 170124.655862 0.137931 -14.149425
134539 456253 1.000000 1.0 5.864376e+06 NaN 0.000000 914615.384615 378608.923077 0.000000 0.0 58369.500000 0.401709 -14.282051
134540 456254 1.000000 1.0 6.669849e+06 NaN 0.000000 45000.000000 0.000000 NaN 0.0 0.000000 0.216216 -18.000000
134541 456255 1.000000 1.0 5.126332e+06 13629.984961 0.112628 342536.498294 179242.109531 0.000000 0.0 1059.695565 0.215017 -26.392491

134542 rows × 13 columns

In [37]:
len(bureau_combine_d.SK_ID_CURR.unique())
Out[37]:
134542
In [38]:
### YOUR CODE HERE ###

installment_payment dataset

In [39]:
installments_payments
Out[39]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.360 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040 2160.585
... ... ... ... ... ... ... ... ...
13605396 2186857 428057 0.0 66 -1624.0 NaN 67.500 NaN
13605397 1310347 414406 0.0 47 -1539.0 NaN 67.500 NaN
13605398 1308766 402199 0.0 43 -7.0 NaN 43737.435 NaN
13605399 1062206 409297 0.0 43 -1986.0 NaN 67.500 NaN
13605400 2448869 434321 1.0 19 -27.0 NaN 11504.250 NaN

13605401 rows × 8 columns

In [40]:
# Combine AMT_INSTALMENT and AMT_PAYMENT into one variable indicating whether the client has
# paid out the installment on time. One new dummy variable should be created based on the following criterion: if AMT_PAYMENT ≥ AMT_INSTALMENT, then the dummy equals 1, otherwise it equals 0.


paydiff= installments_payments['AMT_PAYMENT']-installments_payments['AMT_INSTALMENT']
installments_payments['paid_on_time'] = paydiff.apply(lambda x: 1 if x>=0 else 0)
# plt.plot(a)
In [41]:
installments_payments['paid_on_time']

print('percent of paid on time', installments_payments['paid_on_time'].sum()/len(installments_payments['paid_on_time']))
print('maximum paydiff',paydiff.max(skipna = True))

## drop  AMT_INSTALMENT and AMT_PAYMENT because we already have that dummy variable now
installments_payments_new = installments_payments.drop(['AMT_PAYMENT','AMT_INSTALMENT'],1)
percent of paid on time 0.9045674581734121
maximum paydiff 2630908.935
In [42]:
# Then, calculate the maximum of that dummy variable for each SK_ID_CURR.
maxima = installments_payments_new.groupby('SK_ID_CURR')['paid_on_time'].max()

installments_payments_new['max_paid_on_time']=installments_payments_new['SK_ID_CURR'].map(maxima)
installments_payments_new
Out[42]:
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT paid_on_time max_paid_on_time
0 1054186 161674 1.0 6 -1180.0 -1187.0 1 1
1 1330831 151639 0.0 34 -2156.0 -2156.0 1 1
2 2085231 193053 2.0 1 -63.0 -63.0 1 1
3 2452527 199697 1.0 3 -2418.0 -2426.0 1 1
4 2714724 167756 1.0 2 -1383.0 -1366.0 0 1
... ... ... ... ... ... ... ... ...
13605396 2186857 428057 0.0 66 -1624.0 NaN 0 1
13605397 1310347 414406 0.0 47 -1539.0 NaN 0 1
13605398 1308766 402199 0.0 43 -7.0 NaN 0 1
13605399 1062206 409297 0.0 43 -1986.0 NaN 0 1
13605400 2448869 434321 1.0 19 -27.0 NaN 0 1

13605401 rows × 8 columns

In [43]:
installments_payments_u=installments_payments_new.groupby(by=['SK_ID_CURR']).mean().reset_index()        
len(installments_payments_u.SK_ID_CURR.unique()) ## there would be some duplicate skidcurr  
Out[43]:
339587
In [44]:
len(installments_payments_new.SK_ID_CURR.unique()) ## there would be some duplicate skidcurr
Out[44]:
339587

credit_card_balance dataset

In [45]:
credit_card_balance
Out[45]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 ... 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 ... 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 ... 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 ... 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 ... 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3840307 1036507 328243 -9 0.000 45000 NaN 0.0 NaN NaN 0.000 ... 0.000 0.000 NaN 0 NaN NaN 0.0 Active 0 0
3840308 1714892 347207 -9 0.000 45000 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 0.0 0 0.0 0.0 23.0 Active 0 0
3840309 1302323 215757 -9 275784.975 585000 270000.0 270000.0 0.0 0.0 2250.000 ... 273093.975 273093.975 2.0 2 0.0 0.0 18.0 Active 0 0
3840310 1624872 430337 -10 0.000 450000 NaN 0.0 NaN NaN 0.000 ... 0.000 0.000 NaN 0 NaN NaN 0.0 Active 0 0
3840311 2411345 236760 -10 0.000 157500 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 0.0 0 0.0 0.0 21.0 Completed 0 0

3840312 rows × 23 columns

In [46]:
# For the variables AMT_DRAWINGS_ATM_CURRENT, AMT_DRAWINGS_CURRENT, AMT_DRAWINGS_OTHER_CURRENT,
# AMT_DRAWINGS_POS_CURRENT, AMT_RECIVABLE, and AMT_TOTAL_RECEIVABLE replace all NA with 0 
cols= ['AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT','AMT_DRAWINGS_OTHER_CURRENT','AMT_DRAWINGS_POS_CURRENT','AMT_RECIVABLE','AMT_TOTAL_RECEIVABLE']
credit_card_balance[cols]=credit_card_balance[cols].fillna(0)
credit_card_balance 
Out[46]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 ... 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 ... 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 ... 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 ... 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 ... 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3840307 1036507 328243 -9 0.000 45000 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 NaN 0 NaN NaN 0.0 Active 0 0
3840308 1714892 347207 -9 0.000 45000 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 0.0 0 0.0 0.0 23.0 Active 0 0
3840309 1302323 215757 -9 275784.975 585000 270000.0 270000.0 0.0 0.0 2250.000 ... 273093.975 273093.975 2.0 2 0.0 0.0 18.0 Active 0 0
3840310 1624872 430337 -10 0.000 450000 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 NaN 0 NaN NaN 0.0 Active 0 0
3840311 2411345 236760 -10 0.000 157500 0.0 0.0 0.0 0.0 0.000 ... 0.000 0.000 0.0 0 0.0 0.0 21.0 Completed 0 0

3840312 rows × 23 columns

In [47]:
len(credit_card_balance.SK_ID_CURR.unique())
Out[47]:
103558
In [48]:
# then compute average for each SK_ID_CURR.
ccb_num = credit_card_balance.groupby(by=['SK_ID_CURR']).mean().reset_index()            # group the numeric features by SK_ID_CURR
ccb_num
Out[48]:
SK_ID_CURR SK_ID_PREV MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY ... AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM SK_DPD SK_DPD_DEF
0 100006 1489396.0 -3.5 0.000000 270000.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.000000 NaN 0.000000 NaN NaN 0.000000 0.000000 0.000000
1 100011 1843384.0 -38.5 54482.111149 164189.189189 2432.432432 2432.432432 0.0 0.000000 3956.221849 ... 52402.088919 54433.179122 54433.179122 0.054054 0.054054 0.0 0.000000 25.767123 0.000000 0.000000
2 100013 2038692.0 -48.5 18159.919219 131718.750000 5953.125000 5953.125000 0.0 0.000000 1454.539551 ... 17255.559844 18101.079844 18101.079844 0.255556 0.239583 0.0 0.000000 18.719101 0.010417 0.010417
3 100021 2594025.0 -10.0 0.000000 675000.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.000000 NaN 0.000000 NaN NaN 0.000000 0.000000 0.000000
4 100023 1499902.0 -7.5 0.000000 135000.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.000000 NaN 0.000000 NaN NaN 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
103553 456244 2181926.0 -21.0 131834.730732 296341.463415 24475.609756 26842.388049 0.0 2363.015854 6514.200000 ... 127608.373537 130767.060732 130767.060732 1.048780 1.365854 0.0 0.317073 13.600000 0.000000 0.000000
103554 456246 1079732.0 -5.5 13136.731875 135000.000000 0.000000 15199.256250 0.0 15199.256250 1439.150625 ... 12883.016250 12897.894375 12897.894375 0.000000 2.500000 0.0 2.500000 3.500000 0.000000 0.000000
103555 456247 1595171.0 -49.0 23216.396211 144000.000000 2136.315789 2149.506474 0.0 13.190684 1414.704789 ... 22100.653895 23128.243105 23128.243105 0.115789 0.147368 0.0 0.031579 26.494737 0.031579 0.021053
103556 456248 2743495.0 -13.0 0.000000 900000.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.000000 NaN 0.000000 NaN NaN 0.000000 0.000000 0.000000
103557 456250 1794451.0 -6.5 173589.326250 178875.000000 15000.000000 15000.000000 0.0 0.000000 7540.080000 ... 168446.591250 172237.526250 172237.526250 0.666667 0.666667 0.0 0.000000 4.583333 0.000000 0.000000

103558 rows × 22 columns

In [49]:
credit_card_balance['NAME_CONTRACT_STATUS'].value_counts()
Out[49]:
Active           3698436
Completed         128918
Signed             11058
Demand              1365
Sent proposal        513
Refused               17
Approved               5
Name: NAME_CONTRACT_STATUS, dtype: int64
In [50]:
 ### how to deal with this categorical?
ccb_cat = pd.get_dummies(credit_card_balance.select_dtypes('object'))                  # this got rid of the SK_ID_CURR column ...
ccb_cat
Out[50]:
NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Sent proposal NAME_CONTRACT_STATUS_Signed
0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
... ... ... ... ... ... ... ...
3840307 1 0 0 0 0 0 0
3840308 1 0 0 0 0 0 0
3840309 1 0 0 0 0 0 0
3840310 1 0 0 0 0 0 0
3840311 0 0 1 0 0 0 0

3840312 rows × 7 columns

In [51]:
ccb_cat['SK_ID_CURR'] = credit_card_balance['SK_ID_CURR']                                # so we have to replace it
ccb_cat = ccb_cat.groupby(by = ['SK_ID_CURR']).mean().reset_index()      # could try sum as wel
ccb_cat
Out[51]:
SK_ID_CURR NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Sent proposal NAME_CONTRACT_STATUS_Signed
0 100006 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
1 100011 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
2 100013 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
3 100021 0.411765 0.0 0.588235 0.0 0.0 0.0 0.0
4 100023 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ...
103553 456244 0.878049 0.0 0.121951 0.0 0.0 0.0 0.0
103554 456246 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
103555 456247 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
103556 456248 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0
103557 456250 1.000000 0.0 0.000000 0.0 0.0 0.0 0.0

103558 rows × 8 columns

In [52]:
credit_card_balance_u=pd.merge(ccb_num,ccb_cat,on='SK_ID_CURR')
In [53]:
len(credit_card_balance_u)==len(credit_card_balance_u.SK_ID_CURR.unique())
Out[53]:
True

POS_CASH_balance

In [54]:
POS_CASH_balance
Out[54]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
... ... ... ... ... ... ... ... ...
10001353 2448283 226558 -20 6.0 0.0 Active 843 0
10001354 1717234 141565 -19 12.0 0.0 Active 602 0
10001355 1283126 315695 -21 10.0 0.0 Active 609 0
10001356 1082516 450255 -22 12.0 0.0 Active 614 0
10001357 1259607 174278 -52 16.0 0.0 Completed 0 0

10001358 rows × 8 columns

In [55]:
# Pick variable SK_DPD, replace NA with 0 if any, and make it into a dummy variable based on the
# following criterion: If the original value is 0 then keep it as 0, else replace original value with 1.
POS_CASH_balance['SK_DPD']=POS_CASH_balance['SK_DPD'].fillna(0)
POS_CASH_balance['SK_DPD'] = POS_CASH_balance['SK_DPD'].apply(lambda x: 1 if x>0 else 0)
In [56]:
POS_CASH_balance
Out[56]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
... ... ... ... ... ... ... ... ...
10001353 2448283 226558 -20 6.0 0.0 Active 1 0
10001354 1717234 141565 -19 12.0 0.0 Active 1 0
10001355 1283126 315695 -21 10.0 0.0 Active 1 0
10001356 1082516 450255 -22 12.0 0.0 Active 1 0
10001357 1259607 174278 -52 16.0 0.0 Completed 0 0

10001358 rows × 8 columns

In [57]:
# Then, take the maximum of the newly created dummy variable for each SK_ID_CURR.
maxima = POS_CASH_balance.groupby('SK_ID_CURR')['SK_DPD'].max()

POS_CASH_balance['max_SK_DPD']=POS_CASH_balance['SK_ID_CURR'].map(maxima)
POS_CASH_balance
Out[57]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF max_SK_DPD
0 1803195 182943 -31 48.0 45.0 Active 0 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0 1
3 1903291 269225 -35 48.0 42.0 Active 0 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0 0
... ... ... ... ... ... ... ... ... ...
10001353 2448283 226558 -20 6.0 0.0 Active 1 0 1
10001354 1717234 141565 -19 12.0 0.0 Active 1 0 1
10001355 1283126 315695 -21 10.0 0.0 Active 1 0 1
10001356 1082516 450255 -22 12.0 0.0 Active 1 0 1
10001357 1259607 174278 -52 16.0 0.0 Completed 0 0 0

10001358 rows × 9 columns

In [58]:
# Replace NA with 0 for CNT_INSTALMENT_FUTURE, and take its average for each SK_ID_CURR.
POS_CASH_balance['CNT_INSTALMENT_FUTURE']=POS_CASH_balance['CNT_INSTALMENT_FUTURE'].fillna(0)
POS_CASH_balance
Out[58]:
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF max_SK_DPD
0 1803195 182943 -31 48.0 45.0 Active 0 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0 1
3 1903291 269225 -35 48.0 42.0 Active 0 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0 0
... ... ... ... ... ... ... ... ... ...
10001353 2448283 226558 -20 6.0 0.0 Active 1 0 1
10001354 1717234 141565 -19 12.0 0.0 Active 1 0 1
10001355 1283126 315695 -21 10.0 0.0 Active 1 0 1
10001356 1082516 450255 -22 12.0 0.0 Active 1 0 1
10001357 1259607 174278 -52 16.0 0.0 Completed 0 0 0

10001358 rows × 9 columns

In [59]:
POS_CASH_balance_u=POS_CASH_balance.groupby(by=['SK_ID_CURR']).mean().reset_index()  
In [60]:
len(POS_CASH_balance_u)==len(POS_CASH_balance.SK_ID_CURR.unique())
Out[60]:
True
In [ ]:
 

result report

Report the shape of the combined dataset.

In [61]:
application_train
Out[61]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307507 456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
307508 456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 ... 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0
307509 456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
307510 456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 ... 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0

307511 rows × 122 columns

In [62]:
### YOUR ANSWER HERE ###
## combine these dataset : bureau_combine_d,installments_payments_u,POS_CASH_balance_u,credit_card_balance_u,
application_train_m1=pd.merge(application_train,bureau_combine_d,on='SK_ID_CURR')
application_train_m2=pd.merge(application_train_m1,installments_payments_u,on='SK_ID_CURR')
application_train_m3=pd.merge(application_train_m2,POS_CASH_balance_u,on='SK_ID_CURR')
application_train_m=pd.merge(application_train_m3,credit_card_balance_u,on='SK_ID_CURR')
In [65]:
print('the shape of the combined dataset is:',application_train_m.shape)
application_train_m.to_csv('/home/tianyu/code/project1/data/home-credit-default-risk/application_m.csv')
application_train_m
the shape of the combined dataset is: (28981, 176)
Out[65]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY_x ... CNT_INSTALMENT_MATURE_CUM SK_DPD_y SK_DPD_DEF_y NAME_CONTRACT_STATUS_Active NAME_CONTRACT_STATUS_Approved NAME_CONTRACT_STATUS_Completed NAME_CONTRACT_STATUS_Demand NAME_CONTRACT_STATUS_Refused NAME_CONTRACT_STATUS_Sent proposal NAME_CONTRACT_STATUS_Signed
0 100043 0 Cash loans F N Y 2 198000.0 641173.5 23157.0 ... 13.870968 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
1 100048 0 Cash loans F N Y 0 202500.0 604152.0 29196.0 ... 12.344828 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
2 100050 0 Cash loans F N Y 0 108000.0 746280.0 42970.5 ... 0.000000 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
3 100059 0 Cash loans M Y Y 1 540000.0 675000.0 34596.0 ... 0.000000 0.000000 0.000000 0.600000 0.0 0.400000 0.0 0.0 0.0 0.000000
4 100100 0 Cash loans M Y Y 2 202500.0 796396.5 38443.5 ... 38.774194 0.127660 0.127660 0.968085 0.0 0.000000 0.0 0.0 0.0 0.031915
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28976 456233 1 Cash loans F N Y 0 225000.0 521280.0 23089.5 ... 2.800000 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
28977 456235 0 Cash loans M Y Y 2 90000.0 1078200.0 31522.5 ... 0.000000 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
28978 456242 0 Cash loans M Y Y 0 198000.0 1312110.0 52168.5 ... 3.500000 0.000000 0.000000 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000
28979 456244 0 Cash loans F N Y 0 261000.0 1303812.0 35982.0 ... 13.600000 0.000000 0.000000 0.878049 0.0 0.121951 0.0 0.0 0.0 0.000000
28980 456247 0 Cash loans F N Y 0 112500.0 345510.0 17770.5 ... 26.494737 0.031579 0.021053 1.000000 0.0 0.000000 0.0 0.0 0.0 0.000000

28981 rows × 176 columns

In [ ]:
 

Provide a data dictionary (i.e. an organized list of variables you have added to the main dataset and their descriptions) before performing any statistical modelling.

In [66]:
### YOUR ANSWER HERE ###
print('columns of application train data',application_train_m.columns)
columns of application train data Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY_x',
       ...
       'CNT_INSTALMENT_MATURE_CUM', 'SK_DPD_y', 'SK_DPD_DEF_y',
       'NAME_CONTRACT_STATUS_Active', 'NAME_CONTRACT_STATUS_Approved',
       'NAME_CONTRACT_STATUS_Completed', 'NAME_CONTRACT_STATUS_Demand',
       'NAME_CONTRACT_STATUS_Refused', 'NAME_CONTRACT_STATUS_Sent proposal',
       'NAME_CONTRACT_STATUS_Signed'],
      dtype='object', length=176)
In [160]:
value_description = {
  "clos_active_prop": "the proportion of 0 in the variable STATUS. It is an indication of good_status of credit bureau_loan",
  "closd_active_prop": "the proportion of Closed and Active of CREDIT_ACTIVE",
  "cons_cred_card_prop": 'the proportion of Consumer credit and Credit card of CREDIT_TYPE',
  "paid_on_time":'One new dummy variable should be created based on the following criterion: if AMT_PAYMENT ≥ AMT_INSTALMENT, then the dummy equals 1, otherwise it equals 0.',
  "SK_DPD": 'If the original value SK_DPD is 0 then keep it as 0, else replace original value with 1.'
}
for keys,values in value_description.items():
    print('variable:',keys)
    print('description:',values)
variable: clos_active_prop
description: the proportion of 0 in the variable STATUS. It is an indication of good_status of credit bureau_loan
variable: closd_active_prop
description: the proportion of Closed and Active of CREDIT_ACTIVE
variable: cons_cred_card_prop
description: the proportion of Consumer credit and Credit card of CREDIT_TYPE
variable: paid_on_time
description: One new dummy variable should be created based on the following criterion: if AMT_PAYMENT ≥ AMT_INSTALMENT, then the dummy equals 1, otherwise it equals 0.
variable: SK_DPD
description: If the original value SK_DPD is 0 then keep it as 0, else replace original value with 1.
  • 1.2 Visualizing the response variable [5 Marks]

Determine the response variable in your notebook and explain in plain English what it represents. Provide a Seaborn plot for the response variable and one arbitrary pair of the features that you have created in the previous parts. Use appropriate figure size, title/caption, legend, and axis titles.

In [65]:
### YOUR CODE HERE ###
#Look into the function and corresponding parameters, sns.relplot;
# Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)
import cufflinks as cf
cf.go_offline()
cf.set_config_file(theme='polar')
In [66]:
application_train['TARGET'].value_counts()
Out[66]:
0    282686
1     24825
Name: TARGET, dtype: int64
In [67]:
target_val = application_train['TARGET'].value_counts()
target_df = pd.DataFrame({'labels': target_val.index,
                   'values': target_val.values
                  })
target_df.iplot(kind='pie',labels='labels',values='values', title='Types of target', hole = 0.6)

Explanation: TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.The data is imbalanced (91.9%(Loan repayed-0) and 8.07%(Loan not repayed-1)) and we need to handle this problem.

In [68]:
contract_val = application_train['NAME_CONTRACT_TYPE'].value_counts()
contract_df = pd.DataFrame({'labels': contract_val.index,
                   'values': contract_val.values
                  })
contract_df.iplot(kind='pie',labels='labels',values='values', title='Types of Loan', hole = 0.6)

Explanation: Many people are willing to take cash loan than revolving loan

Distribution of AMT_INCOME_TOTAL.
In [69]:
application_train[application_train['AMT_INCOME_TOTAL'] < 2000000]['AMT_INCOME_TOTAL'].iplot(kind='histogram', bins=100,
   xTitle = 'Total Income', yTitle ='Count of applicants',
             title='Distribution of AMT_INCOME_TOTAL')

Explanation: 1. The distribution is right skewed and there are extreme values, we can apply log distribution.

  1. People with high income(>1000000) are likely to repay the loan.
In [ ]:
 
In [ ]:
 

Part 2: Model Construction using Logistic Regression [10 Marks]

  • 2.1 Constructing the model [5 Marks]

Fit a logistic regression model using the selected explanatory variables and the training data.

In [72]:
# Split the combined dataset into a training (70%) and testing set (30%)

y = application_train_m.pop('TARGET').values
X_train, X_temp, y_train, y_temp = train_test_split(application_train_m.drop(['SK_ID_CURR'],axis=1), y, stratify = y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, stratify = y_temp, test_size=0.5, random_state=42)
print('Shape of X_train:',X_train.shape)
print('Shape of X_val:',X_val.shape)
print('Shape of X_test:',X_test.shape)
Shape of X_train: (20286, 174)
Shape of X_val: (4347, 174)
Shape of X_test: (4348, 174)
In [ ]:
 
In [203]:
# Seperation of columns into numeric and categorical columns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

types = np.array([dt for dt in X_train.dtypes])
all_columns = X_train.columns.values
is_num = types != 'object'
num_cols = all_columns[is_num]
cat_cols = all_columns[~is_num]
# Featurization of numeric data
imputer_num = SimpleImputer(strategy='median')
X_train_num = imputer_num.fit_transform(X_train[num_cols])
X_val_num = imputer_num.transform(X_val[num_cols])
X_test_num = imputer_num.transform(X_test[num_cols])
scaler_num = StandardScaler()
X_train_num1 = scaler_num.fit_transform(X_train_num)
X_val_num1 = scaler_num.transform(X_val_num)
X_test_num1 = scaler_num.transform(X_test_num)
X_train_num_final = pd.DataFrame(X_train_num1, columns=num_cols)
X_val_num_final = pd.DataFrame(X_val_num1, columns=num_cols)
X_test_num_final = pd.DataFrame(X_test_num1, columns=num_cols)
# Featurization of categorical data
imputer_cat = SimpleImputer(strategy='constant', fill_value='MISSING')
X_train_cat = imputer_cat.fit_transform(X_train[cat_cols])
X_val_cat = imputer_cat.transform(X_val[cat_cols])
X_test_cat = imputer_cat.transform(X_test[cat_cols])
X_train_cat1= pd.DataFrame(X_train_cat, columns=cat_cols)
X_val_cat1= pd.DataFrame(X_val_cat, columns=cat_cols)
X_test_cat1= pd.DataFrame(X_test_cat, columns=cat_cols)
ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
X_train_cat2 = ohe.fit_transform(X_train_cat1)
X_val_cat2 = ohe.transform(X_val_cat1)
X_test_cat2 = ohe.transform(X_test_cat1)
cat_cols_ohe = list(ohe.get_feature_names(input_features=cat_cols))
X_train_cat_final = pd.DataFrame(X_train_cat2, columns = cat_cols_ohe)
X_val_cat_final = pd.DataFrame(X_val_cat2, columns = cat_cols_ohe)
X_test_cat_final = pd.DataFrame(X_test_cat2, columns = cat_cols_ohe)
# Final complete data
X_train_final = pd.concat([X_train_num_final,X_train_cat_final], axis = 1)
X_val_final = pd.concat([X_val_num_final,X_val_cat_final], axis = 1)
X_test_final = pd.concat([X_test_num_final,X_test_cat_final], axis = 1)
print(X_train_final.shape)
print(X_val_final.shape)
print(X_test_final.shape)
(20286, 298)
(4347, 298)
(4348, 298)
In [201]:
# Saving the numpy arrays into text files for future use
np.savetxt('y.txt', y)
np.savetxt('y_train.txt', y_train)
np.savetxt('y_val.txt', y_val)
np.savetxt('y_test.txt', y_test)
In [75]:
application_test = pd.read_csv('data/home-credit-default-risk/application_test.csv')
print('testset shape',application_test.shape)
print('train set shape',application_train_m.shape)
testset shape (48744, 121)
train set shape (28981, 175)
In [76]:
def cv_plot(alpha, cv_auc):
    
    fig, ax = plt.subplots()
    ax.plot(np.log10(alpha), cv_auc,c='g')
    for i, txt in enumerate(np.round(cv_auc,3)):
        ax.annotate((alpha[i],str(txt)), (np.log10(alpha[i]),cv_auc[i]))
    plt.grid()
    plt.xticks(np.log10(alpha))
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()
In [79]:
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
In [74]:
## parameter tuning 
alpha = np.logspace(-4,4,9)
cv_auc_score = []
for i in alpha:
    clf = SGDClassifier(alpha=i, penalty='l1',class_weight = 'balanced', loss='log', random_state=28)
    clf.fit(X_train_final, y_train)
    sig_clf = CalibratedClassifierCV(clf, method='sigmoid')
    sig_clf.fit(X_train_final, y_train)
    y_pred_prob = sig_clf.predict_proba(X_val_final)[:,1]
    cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
    print('For alpha {0}, cross validation AUC score {1}'.format(i,roc_auc_score(y_val,y_pred_prob)))
cv_plot(alpha, cv_auc_score)
print('The Optimal C value is:', alpha[np.argmax(cv_auc_score)])
For alpha 0.0001, cross validation AUC score 0.6621669533874007
For alpha 0.001, cross validation AUC score 0.7091321372982714
For alpha 0.01, cross validation AUC score 0.7452130744654706
For alpha 0.1, cross validation AUC score 0.6593842877037766
For alpha 1.0, cross validation AUC score 0.47421413942819696
For alpha 10.0, cross validation AUC score 0.5
For alpha 100.0, cross validation AUC score 0.5
For alpha 1000.0, cross validation AUC score 0.5
For alpha 10000.0, cross validation AUC score 0.5
The Optimal C value is: 0.01
In [80]:
# best_alpha = alpha[np.argmax(cv_auc_score)]
best_alpha=0.01
logreg = SGDClassifier(alpha = best_alpha, class_weight = 'balanced', penalty = 'l1', loss='log', random_state = 28)
logreg.fit(X_train_final, y_train)
logreg_sig_clf = CalibratedClassifierCV(logreg, method='sigmoid')
logreg_sig_clf.fit(X_train_final, y_train)
y_pred_prob = logreg_sig_clf.predict_proba(X_train_final)[:,1]
print('For best alpha {0}, The Train AUC score is {1}'.format(best_alpha, roc_auc_score(y_train,y_pred_prob) ))    
y_pred_prob = logreg_sig_clf.predict_proba(X_val_final)[:,1]
print('For best alpha {0}, The Cross validated AUC score is {1}'.format(best_alpha, roc_auc_score(y_val,y_pred_prob) ))  
y_pred_prob = logreg_sig_clf.predict_proba(X_test_final)[:,1]
print('For best alpha {0}, The Test AUC score is {1}'.format(best_alpha, roc_auc_score(y_test,y_pred_prob) ))
y_pred = logreg.predict(X_test_final)
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
For best alpha 0.01, The Train AUC score is 0.7418899552549255
For best alpha 0.01, The Cross validated AUC score is 0.7363645552192489
For best alpha 0.01, The Test AUC score is 0.7201207422981493
The test AUC score is : 0.7201207422981493
The percentage of misclassified points 39.03% :
In [194]:
import statsmodels.discrete.discrete_model as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import *
In [128]:
correlations=X_train_final.corr() 
In [129]:
sns.heatmap(correlations)
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5630d23f70>
In [141]:
def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if val >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        df = x.drop(col, axis=1)

    return df
In [144]:
x_train_in=corr_df(X_train_final,0.9)
NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_AVG | 0.94
TOTALAREA_MODE | LIVINGAREA_AVG | 0.92
NAME_CONTRACT_STATUS_Demand | SK_DPD_DEF_y | 0.9
EMERGENCYSTATE_MODE_MISSING | HOUSETYPE_MODE_MISSING | 0.95
In [145]:
x_train_in
Out[145]:
CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY_x REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE ... HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_MISSING EMERGENCYSTATE_MODE_Yes
0 0.786191 -1.247782 -0.373707 -0.903409 -0.092926 0.481761 -0.455272 0.307872 -1.255557 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 -0.567112 1.148106 -0.296054 1.415042 1.731974 0.874230 -0.432072 1.001075 -0.914768 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 -0.567112 -1.223823 -1.232636 -1.558341 -0.248690 -1.697412 2.376904 -0.603529 -0.713063 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 -0.567112 -0.433180 -0.337928 -0.481180 -0.793406 1.026554 -0.401677 0.223510 -0.987276 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 -0.567112 -0.529015 0.104105 -0.620547 -0.092926 -1.160650 2.376904 0.709666 -0.652420 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20281 -0.567112 -0.049838 0.073769 0.040734 0.645423 1.272398 -0.405769 0.107118 -0.357772 -0.453142 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
20282 -0.567112 -0.073797 -1.115120 -0.892615 -0.693146 0.464145 -0.434575 -0.980729 -0.793481 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
20283 2.139495 -0.816522 -0.888697 -1.044999 0.685752 0.888737 -0.408903 0.989922 1.244662 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
20284 -0.567112 1.866872 3.636662 2.659508 0.086515 -0.175203 -0.404857 0.521495 1.876803 -0.922263 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
20285 -0.567112 0.908517 -0.792565 -0.401496 1.731974 1.668235 -0.415528 0.239238 0.545286 -0.140394 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

20286 rows × 249 columns

In [170]:
print('target shape',y_train.shape)
y_train
target shape (20286,)
Out[170]:
array([0, 0, 0, ..., 0, 0, 0])
In [181]:
print('target shape',y_test.shape)
y_test
target shape (4348,)
Out[181]:
array([0, 0, 0, ..., 0, 0, 0])
In [ ]:
 

accoridng to the Explanatory variable table, and the correlation calculated above, we selected explanatory variables and the training data.

In [177]:
# x_train_in = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','clos_active_prop','cons_cred_card_prop','paid_on_time']) 
x_train_in = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time']) 
In [178]:
logit = sm.Logit(y_train, x_train_in.fillna(0)).fit()
Optimization terminated successfully.
         Current function value: 0.692721
         Iterations 4
In [179]:
logit.summary()
Out[179]:
Logit Regression Results
Dep. Variable: y No. Observations: 20286
Model: Logit Df Residuals: 20282
Method: MLE Df Model: 3
Date: Tue, 08 Feb 2022 Pseudo R-squ.: -1.388
Time: 12:50:40 Log-Likelihood: -14053.
converged: True LL-Null: -5884.5
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
AMT_INCOME_TOTAL -0.0134 0.014 -0.946 0.344 -0.041 0.014
AMT_CREDIT_MAX_OVERDUE 0.0104 0.016 0.652 0.515 -0.021 0.042
cons_cred_card_prop 0.0122 0.014 0.863 0.388 -0.016 0.040
paid_on_time -0.0541 0.014 -3.836 0.000 -0.082 -0.026
In [176]:
x_train_in2 = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time','FLOORSMIN_MEDI','EXT_SOURCE_#'])

logit = sm.Logit(y_train, x_train_in2.fillna(0)).fit()
Optimization terminated successfully.
         Current function value: 0.692721
         Iterations 4
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-176-a7ebdca0d4fc> in <module>
      1 x_train_in2 = pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time','FLOORSMIN_MEDI','EXT_SOURCE_#'])
      2 
----> 3 logit = sm.Logit(y_train, x_train_in2.fillna(0)).fit()

~/.local/lib/python3.8/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1981     def fit(self, start_params=None, method='newton', maxiter=35,
   1982             full_output=1, disp=1, callback=None, **kwargs):
-> 1983         bnryfit = super().fit(start_params=start_params,
   1984                               method=method,
   1985                               maxiter=maxiter,

~/.local/lib/python3.8/site-packages/statsmodels/discrete/discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    228             pass  # TODO: make a function factory to have multiple call-backs
    229 
--> 230         mlefit = super().fit(start_params=start_params,
    231                              method=method,
    232                              maxiter=maxiter,

~/.local/lib/python3.8/site-packages/statsmodels/base/model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    577             Hinv = cov_params_func(self, xopt, retvals)
    578         elif method == 'newton' and full_output:
--> 579             Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    580         elif not skip_hessian:
    581             H = -1 * self.hessian(xopt)

<__array_function__ internals> in inv(*args, **kwargs)

/usr/lib/python3/dist-packages/numpy/linalg/linalg.py in inv(a)
    549     signature = 'D->D' if isComplexType(t) else 'd->d'
    550     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    552     return wrap(ainv.astype(result_t, copy=False))
    553 

/usr/lib/python3/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     95 
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix
In [217]:
x_train_in3= pd.DataFrame(X_train_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time','EXT_SOURCE_3'])

logit = sm.Logit(y_train, x_train_in3.fillna(0)).fit()
Optimization terminated successfully.
         Current function value: 0.688080
         Iterations 4
In [187]:
x_test_fianl = pd.DataFrame(X_test_final, columns=['AMT_INCOME_TOTAL','AMT_CREDIT_MAX_OVERDUE','cons_cred_card_prop','paid_on_time'])
x_test_fianl
Out[187]:
AMT_INCOME_TOTAL AMT_CREDIT_MAX_OVERDUE cons_cred_card_prop paid_on_time
0 0.045998 -0.034460 0.286603 -4.178137
1 -0.385262 0.008727 0.286603 0.672084
2 -0.289427 -0.040802 0.286603 0.672084
3 -1.199864 -0.040802 0.286603 0.094337
4 1.897875 -0.040802 -7.262057 0.672084
... ... ... ... ...
4343 -0.768604 0.056143 0.286603 0.267066
4344 0.189751 0.093630 -0.185188 0.672084
4345 -0.912358 -0.029352 0.286603 0.672084
4346 0.908517 -0.040802 0.286603 0.672084
4347 0.812682 -0.040802 0.286603 0.672084

4348 rows × 4 columns

In [188]:
len(logit.predict(x_test_fianl))
Out[188]:
4348
In [189]:
# Mark it as a 1 if logit prediction for up is above 50%, and 0 otherwise
predict_label = pd.DataFrame(np.zeros(shape=(4348,1)), columns = ['label'])
predict_label.iloc[logit.predict(x_test_fianl)>0.5] = 1
predict_label
Out[189]:
label
0 1.0
1 0.0
2 0.0
3 1.0
4 0.0
... ...
4343 0.0
4344 0.0
4345 0.0
4346 0.0
4347 0.0

4348 rows × 1 columns

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
  • 2.2 Interpreting the model [5 Marks]

Interpret the estimated coefficients of the logistic regression model. Explain their impact on the response variable in plain English.

In [193]:
### YOUR CODE HERE ###
# that's explanation can be very long

Part 3: Model Evaluation [25 Marks]

  • 3.1 Confusion matrix [5 Marks]

Calculate the confusion matrix using the testing set.

In [ ]:
 
In [190]:
### YOUR CODE HERE ###
#Look into the function and corresponding parameters; confusion_matrix
def plot_confusion_matrix(test_y, predicted_y):
    # Confusion matrix
    C = confusion_matrix(test_y, predicted_y)
    
    # Recall matrix
    A = (((C.T)/(C.sum(axis=1))).T)
    
    # Precision matrix
    B = (C/C.sum(axis=0))
    
    plt.figure(figsize=(20,4))
    
    labels = ['Re-paid(0)','Not Re-paid(1)']
    cmap=sns.light_palette("purple")
    plt.subplot(1,3,1)
    sns.heatmap(C, annot=True, cmap=cmap,fmt="d", xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Confusion matrix')
    
    plt.subplot(1,3,2)
    sns.heatmap(A, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Recall matrix')
    
    plt.subplot(1,3,3)
    sns.heatmap(B, annot=True, cmap=cmap, xticklabels = labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Orignal Class')
    plt.title('Precision matrix')
    
    plt.show()
In [191]:
plot_confusion_matrix(y_test, predict_label['label'])
In [79]:
plot_confusion_matrix(y_test, y_pred)
  • 3.2 Explaining the confusion matrix [5 Marks]

Explain a weakness or a strength of the model based on the confusion matrix.

In [80]:
### YOUR EXPLAINATION HERE ###
  • 3.3 Graphic illustration of the performance [5 Marks]

Plot the ROC curve (receiver operating characteristic curve) showing the performance of the model. Calculate AUC (Area under the curve) respectively

In [81]:
### YOUR CODE HERE ###
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
auc = roc_auc_score(y_test,y_pred_prob)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, marker='.')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC curve', fontsize = 20)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.grid()
plt.legend(["AUC=%.3f"%auc])
plt.show()
In [192]:
### YOUR CODE HERE ###
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, predict_label['label'])
auc = roc_auc_score(y_test,predict_label['label'])
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, marker='.')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.title('ROC curve', fontsize = 20)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.grid()
plt.legend(["AUC=%.3f"%auc])
plt.show()
  • 3.4 Compute the performance measures [5 Marks]

Compute the following measures to assess the performance of the logistic regression model: precision, recall, F_1-score, accuracy, and total misclassification rate; using the testing set.

In [82]:
### YOUR CODE HERE ###
#Given the confusion matrix result, note that this can be computed manually using formula as well
  • 3.5 Suitable performance measures for reporting [5 Marks]

If you were to only choose two of the five performance measures for reporting, which ones do you choose? Justify your choice.

In [83]:
### YOUR EXPLAINATION HERE ###

Part 4: Alternative classification model [20 marks]

Select an alternative classification model that has at least one tunable hyper-parameter.

HINT !

You might want to try random forest or boosted tree etc.

  • 4.1 Train the alternative classification model [5 marks]

Train your alternative model on the same training set.

Selection of features

In [210]:
import re
## to solve issue with nameing
train_features, valid_features, train_y, valid_y = train_test_split(X_train_final, y_train, test_size = 0.15, random_state = 42)
train_features = train_features.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
valid_features = valid_features.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
X_train_final = X_train_final.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
In [211]:
import pickle
import lightgbm as lgb
model_sk = lgb.LGBMClassifier(boosting_type='gbdt', max_depth=7, learning_rate=0.01, n_estimators= 2000, 
                 class_weight='balanced', subsample=0.9, colsample_bytree= 0.8, n_jobs=-1)
model_sk.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], eval_metric = 'auc', verbose = 200)
feature_imp = pd.DataFrame(sorted(zip(model_sk.feature_importances_, X_train_final.columns)), columns=['Value','Feature'])
features_df = feature_imp.sort_values(by="Value", ascending=False)
selected_features = list(features_df[features_df['Value']>=50]['Feature'])
# Saving the selected features into pickle file
with open('select_features.txt','wb') as fp:
    pickle.dump(selected_features, fp)
print('The no. of features selected:',len(selected_features))
[200]	valid_0's auc: 0.725935	valid_0's binary_logloss: 0.546305
[400]	valid_0's auc: 0.732838	valid_0's binary_logloss: 0.495595
The no. of features selected: 74
In [213]:
# Feature importance Plot
import plotly.offline as py
import plotly.graph_objs as go

data1 = features_df.head(20)
data = [go.Bar(x =data1.sort_values(by='Value')['Value'] , y = data1.sort_values(by='Value')['Feature'], orientation = 'h',
              marker = dict(
        color = 'rgba(43, 13, 150, 0.6)',
        line = dict(
            color = 'rgba(43, 13, 150, 1.0)',
            width = 1.5)
    )) ]
layout = go.Layout(
    autosize=False,
    width=1300,
    height=700,
    title = "Top 20 important features",
    xaxis=dict(
        title='Importance value'
        ),
    yaxis=dict(
        automargin=True
        ),
    bargap=0.4
    )
fig = go.Figure(data = data, layout=layout)
fig.layout.template = 'seaborn'
py.iplot(fig)
In [216]:
data1
Out[216]:
Value Feature
297 816 EXT_SOURCE_3
296 701 EXT_SOURCE_2
295 567 EXT_SOURCE_1
294 448 CNT_INSTALMENT_FUTURE
293 386 DAYS_BIRTH
292 365 DAYS_LAST_PHONE_CHANGE
291 328 MONTHS_BALANCE_x
290 328 CNT_DRAWINGS_ATM_CURRENT
289 324 paid_on_time
288 304 AMT_CREDIT
287 288 DAYS_EMPLOYED
286 286 CNT_DRAWINGS_CURRENT
285 273 NUM_INSTALMENT_VERSION
284 260 AMT_CREDIT_LIMIT_ACTUAL
283 256 SK_ID_PREV
282 255 AMT_CREDIT_SUM
281 251 SK_ID_BUREAU
280 246 AMT_ANNUITY_x
279 232 CNT_INSTALMENT
278 232 AMT_CREDIT_SUM_DEBT
In [ ]:
 
In [ ]:
 
In [89]:
 

random forest

In [197]:
alpha = [200,500,1000,2000]
max_depth = [7, 10]
cv_auc_score = []
for i in alpha:
    for j in max_depth:
        clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j,class_weight='balanced',
                                     random_state=42, n_jobs=-1)
        clf.fit(X_train_final, y_train)
        sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
        sig_clf.fit(X_train_final, y_train)
        y_pred_prob = sig_clf.predict_proba(X_val_final)[:,1]
        cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
        print('For n_estimators {0}, max_depth {1} cross validation AUC score {2}'.
              format(i,j,roc_auc_score(y_val,y_pred_prob)))
For n_estimators 200, max_depth 7 cross validation AUC score 0.7112783690461881
For n_estimators 200, max_depth 10 cross validation AUC score 0.7094945474611275
For n_estimators 500, max_depth 7 cross validation AUC score 0.7111527093326923
For n_estimators 500, max_depth 10 cross validation AUC score 0.7113685162319568
For n_estimators 1000, max_depth 7 cross validation AUC score 0.712201011833867
For n_estimators 1000, max_depth 10 cross validation AUC score 0.7118028617633881
For n_estimators 2000, max_depth 7 cross validation AUC score 0.7122201339641814
For n_estimators 2000, max_depth 10 cross validation AUC score 0.7119640340046112
In [ ]:
 
In [ ]:
best_alpha = np.argmax(cv_auc_score)
print('The optimal values are: n_estimators {0}, max_depth {1} '.format(alpha[int(best_alpha/2)],
                                                                        max_depth[int(best_alpha%2)]))
rf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)],
                            class_weight='balanced', random_state=42, n_jobs=-1)
rf.fit(X_train_final, y_train)
rf_sig_clf = CalibratedClassifierCV(rf, method="sigmoid")
rf_sig_clf.fit(X_train_final, y_train)
y_pred_prob = rf_sig_clf.predict_proba(X_train_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Train AUC score is {2}'.format(alpha[int(best_alpha/2)], 
                                                    max_depth[int(best_alpha%2)],roc_auc_score(y_train,y_pred_prob)))
y_pred_prob = rf_sig_clf.predict_proba(X_val_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Validation AUC score is {2}'.format(alpha[int(best_alpha/2)],
                                                            max_depth[int(best_alpha%2)],roc_auc_score(y_val,y_pred_prob)))
y_pred_prob = rf_sig_clf.predict_proba(X_test_final)[:,1]
print('For best n_estimators {0} best max_depth {1}, The Test AUC score is {2}'.format(alpha[int(best_alpha/2)],
                                                        max_depth[int(best_alpha%2)],roc_auc_score(y_test,y_pred_prob)))
y_pred = rf_sig_clf.predict(X_test_final)
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
plot_confusion_matrix(y_test, y_pred)
The optimal values are: n_estimators 1000, max_depth 10 
In [ ]:
 
  • 4.2 Tune the alternative classification model [5 Marks]

Use a k-fold cross validation to tune the hyper-parameters of the alternative model based on accuracy.

In [ ]:
 
In [ ]:
### YOUR CODE HERE ###
cv_auc_score = []
max_depth = [3, 5, 7, 10]
for i in max_depth:
    
    params = {'boosting_type': 'gbdt',
          'max_depth' : i,
          'objective': 'binary',
          'nthread': 5,
          'num_leaves': 32,
          'learning_rate': 0.05,
          'max_bin': 512,
          'subsample_for_bin': 200,
          'subsample': 0.7,
          'subsample_freq': 1,
          'colsample_bytree': 0.8,
          'reg_alpha': 20,
          'reg_lambda': 20,
          'min_split_gain': 0.5,
          'min_child_weight': 1,
          'min_child_samples': 10,
          'scale_pos_weight': 1,
          'num_class' : 1,
          'metric' : 'auc'
          }

    lgbm = lgb.train(params,
                    train_data,
                    2500,
                    valid_sets=valid_data,
                    early_stopping_rounds= 100,
                    verbose_eval= 10
                    )
    y_pred_prob = lgbm.predict(X_val_final)
    cv_auc_score.append(roc_auc_score(y_val,y_pred_prob))
    print('For  max_depth {0} and some other parameters, cross validation AUC score {1}'.format(i,roc_auc_score(y_val,y_pred_prob)))
    print('The optimal  max_depth: ', max_depth[np.argmax(cv_auc_score)])
  • 4.3 Confusion matrix [5 Marks]

Calculate the confusion matrix using the testing set.

In [ ]:
### YOUR CODE HERE ###
params = {'boosting_type': 'gbdt',
          'max_depth' : max_depth[np.argmax(cv_auc_score)],
          'objective': 'binary',
          'nthread': 5,
          'num_leaves': 32,
          'learning_rate': 0.05,
          'max_bin': 512,
          'subsample_for_bin': 200,
          'subsample': 0.7,
          'subsample_freq': 1,
          'colsample_bytree': 0.8,
          'reg_alpha': 20,
          'reg_lambda': 20,
          'min_split_gain': 0.5,
          'min_child_weight': 1,
          'min_child_samples': 10,
          'scale_pos_weight': 1,
          'num_class' : 1,
          'metric' : 'auc'
          }
lgbm = lgb.train(params,
                 train_data,
                 2500,
                 valid_sets=valid_data,
                 early_stopping_rounds= 100,
                 verbose_eval= 10
                 )
y_pred_prob = lgbm.predict(X_train_final)

print('For best max_depth {0}, The Train AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)], 
                                                                  roc_auc_score(y_train,y_pred_prob) ))    
y_pred_prob = lgbm.predict(X_val_final)
print('For best max_depth {0}, The Cross validated AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)], 
                                                                            roc_auc_score(y_val,y_pred_prob) ))  
y_pred_prob = lgbm.predict(X_test_final)
print('For best max_depth {0}, The Test AUC score is {1}'.format(max_depth[np.argmax(cv_auc_score)], 
                                                                 roc_auc_score(y_test,y_pred_prob) ))
y_pred = np.ones((len(X_test_final),), dtype=int)
for i in range(len(y_pred_prob)):
    if y_pred_prob[i]<=0.5:
        y_pred[i]=0
    else:
        y_pred[i]=1
print('The test AUC score is :', roc_auc_score(y_test,y_pred_prob))
print('The percentage of misclassified points {:05.2f}% :'.format((1-accuracy_score(y_test, y_pred))*100))
plot_confusion_matrix(y_test, y_pred)
  • 4.4 Graphic illustration of the performance [5 Marks]

Plot the ROC curve (receiver operating characteristic curve) showing the performance of the model. Calculate AUC (Area under the curve) respectively

In [ ]:
### YOUR CODE HERE ###

Part 5: Compare the two models [15 Marks]

Use the test data to compare the logistic regression model and the tuned alternative model.

  • 5.1 Compare predictive performance [5 Marks]

Choose a suitable plot and illustrate all the five performance measures (precision, recall, F_1-score, accuracy, and total misclassification rate) for the two models. Use appropriate figure size, title/caption, legend, and axis titles.

In [ ]:
### YOUR CODE HERE ###
#bar plot
  • 5.2 Recommend model to client (scenario 1) [5 Marks]

Among the two models you have developed, what is you suggested model to Home Credit Group (client of the project), if they are more concerned about rejecting applications of individuals who may otherwise end up having difficulty in repayment? Provide a numeric measure for each of the two models to justify your response.

In [ ]:
### YOUR EXPLAINATION HERE ###
  • 5.3 Recommend model to client (scenario 2) [5 Marks]

Among the two models you have developed, what is you suggested model to Home Credit Group (client of the project), if they are more concerned about not rejecting applications of individuals who may be capable of repayment according to the data? Provide a numeric measure for each of the two models to justify your response.

In [ ]:
### YOUR EXPLAINATION HERE ###